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Abstract 

In surveys of hybrid zones, dominant genetic markers are often used to identify 
individuals of hybrid origin and assign these individuals to one of several 
potential hybrid classes. Quantitative analyses that address the statistical power 
of dominant markers in such inference are scarce. In this study, dominant 
genotype data were simulated to evaluate the effects of, first, the number of loci 
analyzed, second, the magnitude of differentiation between the markers scored 
in the groups that are hybridizing, and third, the level of genotyping error asso- 
ciated with the data when assigning individuals to various parental and hybrid 
categories. The overall performance of the assignment methods was relatively 
modest at the lowest level of divergence examined {F^t ~ 0.4), but improved 
substantially at higher levels of differentiation (F^^ ~ 0.67 or 0.8). The effect of 
genotyping error was dependent on the level of divergence between parental 
taxa, with larger divergences tempering the effects of genotyping error. These 
results highlight the importance of considering the effects of each of the vari- 
ables when assigning individuals to various parental and hybrid categories, and 
can help guide decisions regarding the number of loci employed in future 
hybridization studies to achieve the power and level of resolution desired. 
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Introduction 

Hybridization and genetic introgression are biological 
phenomena that have impacted the evolutionary trajec- 
tory of many taxa. Hybridization has long been viewed as 
important in the origins of many plant species (Anderson 
1949; Stebbins 1950, 1959; Grant 1971; Abbott 1992; 
Ungerer et al. 1998; Rieseberg et al. 2003; Cronn and 
Wendel 2004; Soltis and Soltis 2009) and more recently 
has also been recognized as a contributor to speciation in 
animals (Dowling and Secor 1997; Gompert et al. 2006; 
Mavarez and Linares 2008). In contrast, hybridization and 



introgression may result in decreased diversity among 
lineages. Rare taxa may become genetically "swamped" by 
more common taxa through introgression (Childs et al. 
1996; Levin et al. 1996; Rhymer and Simberloff 1996; 
Riley et al. 2003; Mank et al. 2004; Roberts et al. 2010; 
Rodriguez et al. 2011), or two relatively common taxa 
may simply merge into a hybrid swarm through "specia- 
tion reversal" (Seehausen et al. 1997; Seehausen 2006; 
Taylor et al. 2006). The diverse evolutionary outcomes 
that hybridization and genetic introgression can produce 
make them important factors to consider when trying to 
characterize levels and patterns of existing biodiversity, 
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understand the origins of this diversity, and, when rele- 
vant, make decisions related to conservation of taxa. 

In studies of ongoing hybridization, it is often impor- 
tant to be able to assign individuals sampled from natural 
populations to a series of parental or hybrid categories. 
Advancements in molecular genetic techniques, such as 
the identification of novel classes of highly variable 
genetic markers (Schlotterer 2004; DeYoung and Honey- 
cutt 2005; Sanz et al. 2009), and the development of more 
powerful statistical methodology for analyzing genetic 
data (see Manel et al. 2005) have made major contribu- 
tions to our ability to effectively identify hybrids. 

The power that molecular markers provide in assigning 
individuals to a series of potential parental or hybrid cate- 
gories is generally recognized to be a function of (1) how 
informative the loci analyzed are and (2) the number of 
loci included in the analysis. Boecklen and Howard (1997) 
provided an initial quantitative evaluation of the power of 
molecular markers to identify hybrids. They assessed both 
dominant and codominant markers, assuming that all of 
the markers were fully diagnostic. This work suggested that 
as few as 5 diagnostic markers may be sufficient for coarse 
identifications in hybrid zones (distinguishing between 
parental and hybrid individuals). However, identifying 
markers that are known to be diagnostic in natural popu- 
lations is often prohibitively difficult due to the large sam- 
ple sizes required. Difficulties in confidently identifying 
diagnostic markers are especially acute when the markers 
used are expressed in a dominant manner. 

More recently, Vaha and Primmer (2006) used data gen- 
erated by simulation to assess the efficiency of using 
codominant microsatellite markers to identify hybrids. In 
contrast to the analyses of Boecklen and Howard (1997), 
they assumed that no diagnostic markers were available for 
the taxa of interest. The levels of differentiation assessed by 
Vaha and Primmer (F^^ = 0.03-0.21) are likely to be repre- 
sentative of those observed among populations within a 
species, or possibly between very recently diverged species 
(or subspecies). However, many cases of hybridization are 
between taxa that are more deeply divergent than those 
assessed by Vaha and Primmer (2006). Further, for many 
cases, microsatellite primers may not be available for the 
taxa of interest. In some cases, other markers can be used 
and often are preferable to dominant markers for detecting 
hybrids. However, the development and use of such mark- 
ers generally require either prior information about the 
genome (i.e., microsatellites), or can be rather expensive to 
generate (i.e., - novel NextGen sequencing methods, such 
as RADseq). In these situations, dominant markers, such as 
AFLP (Vos et al. 1995) or RAPD loci (Welsh and McClel- 
land 1990; Williams et al. 1990) may be of interest, given 
that genotype data can be generated for such markers with- 
out any prior information about the genome. Indeed, a 
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number of recent studies applied dominant markers to dis- 
tinguish among parental and hybrid individuals in a wide 
variety of taxa, including plants (Wallace 2006; Magnussen 
and Hauser 2007; Liebst 2008; Milne and Abbott 2008; Gas- 
kin et al. 2009; Erfmeier et al. 2011), birds (Haig et al. 
2004; Helbig et al. 2005), barnacles (Tsang et al. 2008), 
reptiles (Fitzpatrick et al. 2008; Mebert 2008), amphibians 
(Yamazaki et al. 2008), ticks (Araya-Anchetta et al. 2013), 
butterflies (Kronforst et al. 2006; Isaza et al. 2012), and 
fishes (Young et al. 2001; Huang et al. 2005; Yamazaki 
et al. 2005; Albert et al. 2006; Oliveira et al. 2006). These 
applications of dominant markers have occurred in spite of 
the fact that little quantitative assessment exists with which 
to evaluate the power of such markers in assigning individ- 
uals to various parental and hybrid categories. The num- 
bers of dominant loci included in studies of hybridization 
vary widely, as does the information content in these loci 
(which may be reported quantitatively, qualitatively, or not 
at all). In addition, even though the potential for genotyp- 
ing error to occur when scoring dominant markers is not 
trivial (Jones et al. 1997; Perez et al. 1998; Bonin et al. 
2004; Pompanon et al. 2005), the effect of such error rates 
on inferences about hybridization has not been empirically 
evaluated. 

In this study, we assess the performance of dominant 
markers in correctly assigning individuals to hybrid cate- 
gories, considering various levels of divergence and vari- 
ous numbers of loci. In addition, we evaluate the effects 
of different levels of genotyping error on the inferences 
drawn. The levels of divergence and genotyping error and 
the number of loci assessed are chosen to represent those 
that are likely to be observed in studies employing domi- 
nant markers for analysis of hybridization among species. 

Methods 

Simulation of parental and hybrid 
genotypes 

Dominant fingerprint data were simulated in R 2.14.0 (R 
Development Core Team 2011) by assuming divergence 
of descendant populations from an initial ancestral popu- 
lation fixed for the dominant allele (denoted "1") at each 
locus. We assumed divergence into 2 independent popu- 
lations of equal size (Ne = 2 X 10^). After diver eence, 
allele frequencies at each locus were simulated by model- 
ing the effects of mutation, with constant underlying 
mutation rate (2 x 10^^), and drift. Because the proba- 
bility of mutation resulting in a recessive allele (repre- 
sented by a change from 1 to 0 in the simulation) is 
much greater than the probability of generating a novel 
dominant allele, we simulated mutation in one direction 
only, and ignored reverse mutation (0-1). Populations 
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were sampled at a series of generation times (5 x 10^, 
7.5 X 10^, 1 X 10*, and 3 x 10*" generations), and for a 
range of numbers of polymorphic loci (25, 50, 75, 100, 
and 125). The generation times selected resulted in data 
with F,t values of 0.43, 0.55, 0.67, and 0.81. These values 
are representative of those that are often observed 
between potentially hybridizing species (i.e., Schulte et al. 
2010; Sternkopf et al. 2010; Jacquemyn et al. 2012; Vrancken 
et al. 2012). The numbers of loci were selected to represent a 
range of those often included in hybridization studies using 
dominant markers. For each divergence time/locus number 
combination, 10 replicate simulations were performed, and 
two sets of 50 parental genotypes were generated for each of 
the 10 replicates by randomly sampling from the lineages 
based on their respective simulated allele frequencies. 

For each replicate, genotypes (N = 50 each) of first-gener- 
ation hybrids (Fi), second-generation hybrids (F2), and 
first-generation backcrosses to each parent (B x I and 
B X 2) were also generated. Fj genotypes were obtained by 
sampHng directly from the allele frequencies of the parental 
groups. F2 and backcross genotypes were generated by first 
calculating the expected allele frequencies in the Fi generation 
and then sampling from these according to patterns expected 
from the independent segregation of alleles in the respective 
crosses (Fi x Fi, Fj x Parent 1, or F^ X Parent 2). All geno- 
type data were converted to phenotypes and stored as binary 
matrices for analysis (a total of 200 datasets prior to the 
introduction of error, representing 20 divergence/locus 
combinations). Each dataset contained 300 individuals dis- 
tributed equally among the 6 parental and hybrid categories. 

Measures of differentiation 

Levels of differentiation between polymorphic markers in 
the parental groups were estimated from the 100-locus 
datasets using Hickory (Holsinger et al. 2002). Each data- 
set was analyzed with the fuU model method, providing 
an estimate of 0*''. The values of 6*'* can be shown to 
correspond directly to Wright's _Fst (Song et al. 2003), 
and we therefore report these values as a surrogate for F^t 
in the remainder of this text. It is important to note that 
the values we use and report do not necessarily repre- 
sent the average F^t of the genome as a whole, but instead 
are measures of differentiation of the specific set of mark- 
ers chosen to assess hybridization. Average fjt values of 
the marker sets were estimated for the set of 10 replicates 
at each of the four divergence levels, corresponding to 
various degrees of separation of the parental populations. 

Simulation of genotyping error 

After data were simulated as described above, error was 
introduced into each dataset using R to randomly select 
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cells in the matrices and replace the selected cells with the 
alternate phenotype. Error rates of 1%, 3%, and 5% were 
incorporated into each of the datasets to represent levels 
of genotyping error that may be likely to occur in domi- 
nant marker datasets. All of the datasets that included 
genotyping error {N = 200 for each of the three error 
rates) were analyzed using NewHybrids, and scored for 
efficiency, accuracy, and performance using the methods 
described below. 

Analyses 

The software package NewHybrids (Anderson and 
Thompson 2002) was used to probabilistically assign each 
individual into one of six parental or hybrid categories 
(PI, P2, Fi, F2, B X 1, B X 2). Reference information 
(option z) was included for 10% of the parental samples 
(five individuals from each of the two parental groups in 
each dataset). Assignment probabilities were estimated 
based on 50,000 MCMC sweeps after a burn-in period of 
10,000 sweeps. Uninformative (Jeffreys) priors were 
placed on both the allele frequency and admixture distri- 
butions, although a series of runs with uniform priors 
suggested that the analyses were not sensitive to the type 
of prior used. 

Following Vaha and Primmer (2006), individuals were 
assigned to a given category if their estimated posterior 
probability of assignment to that category was at least 
50%. The assignments made in each dataset were 
assessed according to three parameters - efficiency, accu- 
racy, and overall performance. The use of these parame- 
ters follows that described in Vaha and Primmer (2006). 
Efficiency is defined as the probability of correctly 
assigning an individual of a given category to that 
category (i.e., - assigning an Fi hybrid to the Fi hybrid 
category), while accuracy is defined as the proportion of 
individuals assigned to any given category that actually 
belong to that category. Overall performance was then 
simply calculated as the product of those two values. All 
three measures (efficiency, accuracy, and performance) 
were calculated as means across the 10 replicate datasets 
for each divergence/locus number combination at two 
levels of resolution: 1) assigning individuals into the 
broad categories of parental or hybrid and 2) assigning 
individuals to each of the six categories. In the latter 
case, results for the parental and backcross categories are 
aggregated as single combined values averaged across the 
P1/P2 and B x 1/ B x 2 categories, respectively. All 
results were plotted in R. 

In addition, the six category analyses were used to eval- 
uate whether misassignments in the datasets were made at 
random or whether certain classes of individuals were 
preferentially misassigned to specific categories. For the 
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misassigned individuals in each category (Parental, Fi, F2, 
and backcross), the proportions that were assigned to 
each of the other categories (or not assigned to any cate- 
gory with a probability of 0.5 or greater) were calculated 
as an average over all of the divergence level/locus num- 
ber combinations, and the results were plotted in R. 

Results 

Assignment to parental or hybrid categories 

Efficiency, accuracy, and performance generally improved 
with increasing numbers of loci, increasing divergence 
levels, and with decreasing error rates (Fig. 1). Of the 
variables tested, the level of divergence between parental 
populations and the number of loci analyzed had signifi- 
cant impacts on all three of the performance measures. 
As each of these variables increased in magnitude, all 
three operational measures improved in value, indicating 
that there is increasing power with which to draw infer- 
ences from the dominant marker fingerprints. Genotyping 
error rates also impacted the inferences, but the magni- 
tude of the effects of genotyping error varied, with the 



importance of genotyping error being tempered by greater 
levels of divergence between parental populations. 

Assuming no genotyping error, overall performance 
(the product of efficiency and accuracy) failed to reach a 
level of 0.95 in any of the analyses at the lowest level of 
divergence. A performance value of 95% was achieved 
with 100 loci at the next highest divergence, _Fst = 0.55. 
At the highest divergence, Fst = 0.81, a performance value 
of 95% is reached using between 25 and 50 loci. 

The highest rate of genotyping error assessed (5%) had 
significant impacts on inferences about hybridization at 
the lower divergence levels. For example, the performance 
value based on 125 loci decreased 14.2% (from 91.2% with 
no error to 77.0% with the highest error rate) at the lowest 
level of divergence, f^, = 0.43. In contrast, the comparable 
performance values measured at the highest divergence, 
Fst = 0.81, dropped only 0.5% (from 100% with no error 
to 99.5% with 5% genotyping error) (Fig. 1). 

Assignment to individual categories 

Assignment of individuals to a more refined set of catego- 
ries was not nearly as successful as simply distinguishing 
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Figure 1. Average efficiency (circles), accuracy (triangles), and performance (squares) values for assignments of individuals into parental or hybrid 
categories. Assignments are based on a range of locus numbers (25-125) generated across three divergence levels, and incorporating three levels 
of genotyping error. A value of 0.95 is represented by the dashed line. 
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Figure 2. Average efficiency (circles), accuracy (triangles), and performance (squares) values for assignments of individuals into each of six 
categories (PI, P2, F,, F2, B x 1, B x 2). Charts labeled "Parental" and "Backcross" represent values averaged across the PI and P2 and B x 1 
and B X 2 categories, respectively. Assignments are based on a range of locus numbers (25-125), simulated at the lowest level of divergence 
analyzed in this study (fj, = 0.43). A value of 0.95 is represented by the dashed line. 



parental individuals from hybrids. Overall, assessments of 
efficiency, accuracy, and performance in the assignment 
of individuals to specific parental and hybrid categories 
(PI, P2, Fi, F2, B X 1, and B x 2) revealed patterns that 
are generally consistent with those observed when assign- 
ments were made to the broader parental or hybrid cate- 
gories. Measures of performance generally increased as 
locus number and divergence level increased, and as error 
rate decreased (Figs 2-5). As a whole, performance levels 
(including all three measures) were better when assigning 
parental and Fi individuals than when assigning F2 and 
backcross individuals. 

Overall performance values at the lowest divergence 
level were well below 0.95 for even the most optimal 
locus number/error rate combinations tested (Fig. 2). Per- 
formance values improved at the moderate and high 
divergence levels, with 100 loci sufficient to achieve per- 
formance values approaching or exceeding 0.95 for each 
category when f^, = 0.67 (Fig. 4), and 75 loci sufficient to 
achieve these values at the highest divergence for all cate- 
gories except F2 (0.940, Fig. 5). With the highest rates of 
error, performance at or near a level of 0.95 can be 
achieved for most categories in only the scenario of high- 
est divergence, and with at least 100-125 loci analyzed. 



In general, performance values were highest for the 
parental and Fi categories, but were lower when evaluat- 
ing the assignment of later-generation hybrid individuals 
(F2 and backcross categories). Performance for the assign- 
ment of individuals into the F2 category exceeds 0.95 in 
the highest divergence scenario, with error rates less than 
3% and at least 100 loci. 

Patterns of misassignment 

To better understand how individual assignment using 
dominant markers could contribute to the misidentifica- 
tion of individual groups, we examined situations in 
which assignment was not performed accurately and ana- 
lyzed the patterns of misassignment. Overall, the average 
proportion of misassignments was quite low for the 
parental and Fi groups (2.1% and 3.64%, respectively), 
but higher for F2 or backcross categories (31.9% and 
27.2%) (Fig. 6). The patterns of misassignment are very 
revealing. Among the small percentage of parentals misas- 
signed, most were identified as backcross individuals, and 
only a small proportion were misidentified as Fi hybrids. 
For the small proportion of misassigned Fi individuals, 
about half were identified as parentals, while most of the 
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Figure 3. Average efficiency (circles), accuracy (triangles), and performance (squares) values for assignments of individuals into eacin of six 
categories (PI, P2, F,, F2, B x 1, B x 2). Charts labeled "Parental" and "Backcross" represent values averaged across the PI and P2 and B x 1 
and B X 2 categories, respectively. Assignments are based on a range of locus numbers (25-125), simulated at a divergence level of fst = 0.55. A 
value of 0.95 is represented by the dashed line. 



remaining individuals were classified ambiguously. Most 
misidentified F2 individuals are classified as some type of 
hybrid (either Fi or backcross) and rarely as a parental, 
while misidentified backcross individuals are usually clas- 
sified as parentals or, less often, as Fi individuals. Overall, 
the patterns of misassignment suggest that the estimated 
proportion of nonparental forms will likely be close to 
the actual proportion in the population, with misassign- 
ments from various groups balancing one another in 
many cases. 

Discussion 

This study extends previous work that has evaluated the 
power of tests used to assign individuals to various hybrid 
categories using diagnostic markers (Boecklen and How- 
ard 1997) and codominant markers at low levels of diver- 
gence (Vaha and Primmer 2006). Here, we have evaluated 
whether dominant markers have sufficient power to war- 
rant their use for hybrid assessment. In addition, this 
study addresses recommendations by Bonin et al. (2004) 
and Pompanon et al. (2005) for quantitative evaluation of 
the effects of genotyping error on overall inferences. 



Several factors determine the power of the inferences 
that can be drawn using any genetic marker. These factors 
include the information content of the specific loci used, 
the number of loci analyzed, and the rates of genotyping 
error associated with the data. We calculated three sepa- 
rate measures (efficiency, accuracy, and overall perfor- 
mance) to evaluate how these factors affect assignments 
of individuals to correct hybrid categories. It is important 
to note that overall performance is the most conservative 
of the three measures. For example, in some cases, mea- 
sures of accuracy and efficiency could each exceed 0.95, 
but the associated performance statistic (the product of 
accuracy and efficiency) fail to reach such a value. In 
most situations in which unknown individuals are to be 
assigned to various parental or hybrid categories, perfor- 
mance wOl likely be the most relevant measure. Neverthe- 
less, many questions about hybridization may only 
require that one of its component measures (efficiency or 
accuracy) exceed a given critical value. The most appro- 
priate measure to consider should be determined on a 
case-by-case basis given the specific goals of the study. 

As expected, in most cases, measures of performance 
increased as the number of loci increased, as the 
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Figure 4. Average efficiency (circles), accuracy (triangles), and performance (squares) values for assignments of individuals into eacin of six 
categories (PI, P2, F,, F2, B x 1, B x 2). Cinarts labeled "Parental" and "Bacl<cross" represent values averaged across tlie PI and P2 and B x 1 
and B X 2 categories, respectively. Assignments are based on a range of locus numbers (25-125), simulated at a divergence level of fj, = 0.67. A 
value of 0.95 is represented by the dashed line. 



divergence rates between parental taxa (and thus, the 
average information content of each locus concerning 
parental origin) increased, and as the rates of genotyping 
error decreased. Exceptions occurred with efficiency of 
identifying F2 individuals at the lower levels of divergence. 
Specifically, F2 efficiency decreased when increasing loci 
from 25 to 75 at the lowest divergence levels and 
increased in some cases with increasing error rates. It is 
not obvious why this anomalous pattern occurred. How- 
ever, we believe that these results are likely due to a com- 
bination of the relatively low information content in the 
loci (decrease in efficiency at low divergence level), and 
the expectation that F2 individuals best fit a pattern of a 
random expression of parental combinations of alleles, 
which is generated with increasing error rates (increased 
F2 efficiency with increased error rates). In general, per- 
formance measures had higher values when assignments 
were made to the broad parental and hybrid classes than 
when assignments were made to more specific categories. 
This was a reflection of difficulties that will always be 
inherent when attempting to distinguish among a larger 
number of hybrid categories (Fj, F2, B x 1, and B x 2). 

The results clearly demonstrate the importance of con- 
sidering the information content of the loci used when 



attempting to assign individuals to various categories. For 
example, performance measures were relatively poor at 
the lowest level of divergence examined even when ana- 
lyzing the largest numbers of loci assessed in this study. 
This suggests that using dominant markers to assign indi- 
viduals effectively to parental and hybrid categories at or 
below this level of divergence (-F^t = 0.43) would require 
collecting information on more than 125 polymorphic 
loci. If available, codominant markers such as microsatel- 
lites may be more appropriate for such situations, because 
they would likely be able to achieve the same performance 
with fewer loci. Alternatively, however, the relative ease of 
developing and identifying additional dominant loci that 
are sufficiently informative may be quite cost effective in 
allowing farther screening. 

Because each locus in the genome evolves indepen- 
dently, the stochastic nature of drift will cause some loci 
to be more informative than others, given a specific level 
of genome-wide divergence. Although F^t is usually 
thought of as representing a level of divergence for loci 
fi-om throughout the genome as a whole, the F^, that we 
used in these analyses described the average divergence 
across the set of polymorphic loci that were used to 
evaluate the performance measures. This value will be 
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Figure 5. Average efficiency (circles), accuracy (triangles), and performance (squares) values for assignments of individuals into eacin of six 
categories (PI, P2, F,, F2, B x 1, B x 2). Charts labeled "Parental" and "Backcross" represent values averaged across the PI and P2 and B x 1 
and B X 2 categories, respectively. Assignments are based on a range of locus numbers (25-125), simulated at the highest level of divergence 
analyzed in this study (fs, = 0.81). A value of 0.95 is represented by the dashed line. 



different from the Fst for the entire genome. Conse- 
quently, our reported levels of f^t may actually overstate 
the average genome-wide level of divergence in the paren- 
tal species. However, the f^, of the loci used, which may 
not be representative of the average level of divergence 
across the genome as a whole, is much more relevant in 
the context of providing an indication of the power avail- 
able to make inferences regarding hybridization. Identify- 
ing and specifically selecting the most informative loci for 
use in analyses of hybridization has the potential to pro- 
vide much stronger inferences than would otherwise be 
available when sampling loci at random from the genome. 

Due to the importance of the information content of 
the available loci and the variability in the levels of infor- 
mation that exists among different loci within a genome, 
future studies should, whenever possible, provide quanti- 
tative measures of diversity among the potentially 
hybridizing taxa based on the specific loci used in the 
study. Alternatively, power of assignment could be tested 
on a case-by-case basis by simulating offspring of the rele- 
vant hybrid classes, beginning with the parental genotypes 
in the study. The performance of the methods in assign- 
ing individuals to the appropriate categories could then 
be evaluated based on these simulated individuals. 



Questions about repeatability of data for dominant 
markers (especially RAPD data) have been an issue of 
concern in recent years (Jones et al. 1997; Perez et al. 
1998; Bonin et al. 2004), and Crawford et al. (2012) 
recently discussed the importance of clearly reporting 
error rates associated with genotyping data. In compari- 
son with the effect of divergence levels on the parameters 
estimated in this study, the rates of error incorporated 
into the simulated datasets had relatively little impact on 
overall inferences, although they did reduce the measures 
of performance slightly, especially at the highest error 
rates. Importantly, error rates appear to have greater 
effects at lower divergence, suggesting that the increasing 
signal in the data minimizes the impact of genotyping 
error. The results suggest that simply deeming the error 
rate associated with a given dataset to be "low" (which 
generally includes up to around a 5% mismatch error 
rate) may not provide a sufficiently rigorous assessment 
of the effects of genotyping error on the inferences drawn 
in the study. Instead, they highlight the importance of 
quantitatively evaluating the effects of genotyping error 
rates on a case-by-case basis, in the context of other fac- 
tors, including the level of diversity that occurs between 
the markers scored in the hybridizing taxa. 
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Figure 6. Proportions of misassigned individuals assigned to each incorrect category for each of the four classes (parental, F,, F2, and backcross). 
The category "NA" includes instances in which an individual was not assigned to any category with a probability of 0.5 or greater. Data for each 
category are based on the average proportion of misassignments to each of the incorrect categories across all divergence levels and locus 
numbers. The percentages at the top of each plot indicate the average proportion of individuals in that category that were misassigned in the 
total dataset. 



The analyses of misassigned individuals indicate that 
biases often exist regarding the pattern of misassignment, 
with individuals of a given class more likely to be misas- 
signed to certain categories than others. In most cases, 
the patterns observed are not surprising, such as in the 
case where misassigned first-generation backcrosses are 
identified as the parental species involved in the back- 
cross, and vice-versa. However, the reasons for other pat- 
terns are less obvious, as in the case of Fj individuals 
being preferentially assigned as parentals. 

Previous studies that have assigned individuals to paren- 
tal and hybrid categories using dominant markers have 
been based on a wide range of numbers of loci. For exam- 
ple, searches in the literature identified studies that have 
used as few as 4 loci (Gonzalez-Perez et al. 2004) and as 
many as 657 loci (Kronforst et al. 2006). While the number 
of loci necessary to successfully examine a given situation 
will vary based on factors such as the goals of the study, the 
resolution required, and the divergence levels of the 
hybridizing taxa, it is also likely that much of the variation 
in the numbers of markers used in previous studies is due 
at least in part to a lack of appropriate studies that aim to 



quantify the numbers of markers required under the differ- 
ent conditions found in specific cases. The results presented 
here help to provide a framework upon which decisions 
can be based when determining the number of dominant 
loci necessary to achieve the power and resolution desired 
in future studies. 
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